library(readxl)
library(skimr)
setwd("/Users/himanshusahrawat/MBA projects/CPS/R/")
data=read_excel("Central parking dataset.xls", sheet = 1)
New names:
data_s=read_excel("Central parking dataset.xls", sheet = 2)
New names:
names(data)
[1] "Vehicle" "Equipment" "DateIn" "Time In" "DateOut" "Time Out" "Amount"
[8] "TimeDiff" "Ticket_Type" "Weekday" "...11" "...12" "...13"
print("--------------------------------------------------------------------------------")
[1] "--------------------------------------------------------------------------------"
names(data_s)
[1] "Vehicle" "Equipment" "DateIn" "...4" "DateOut" "...6" "Amount"
[8] "TimeDiff" "Ticket_Type" "Weekday" "...11" "...12" "...13"
data=data[ , c("Vehicle","Equipment","DateIn","Time In","DateOut","Time Out","Amount","TimeDiff","Ticket_Type","Weekday" )]
data_s=data_s[ , c("Vehicle","Equipment","DateIn","...4","DateOut","...6","Amount","TimeDiff","Ticket_Type","Weekday" )]
colnames(data_s)[4] <- "Time In"
colnames(data_s)[6] <- "Time Out"
skim(data)
── Data Summary ────────────────────────
Values
Name data
Number of rows 5000
Number of columns 10
_______________________
Column type frequency:
character 2
numeric 4
POSIXct 4
________________________
Group variables None
── Data Summary ──────────────────────── Values Name data
Number of rows 5000
Number of columns 10
_______________________
Column type frequency:
character 2
numeric 4
POSIXct 4
________________________
Group variables None
── Variable type: character ────────────────────────────────────────── skim_variable n_missing complete_rate min max empty n_unique 1 Ticket_Type 0 1 6 6 0 2 2 Weekday 0 1 6 9 0 5 whitespace 1 0 2 0
── Variable type: numeric ──────────────────────────────────────────── skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 1 Vehicle 0 1 1 0 1 1 1 1 2 Equipment 0 1 5.63 0.913 5 5 5 6 3 Amount 0 1 36.8 28.5 30 30 30 40 4 TimeDiff 0 1 136. 83.9 2 76 122 179 p100 hist 1 1 ▁▁▇▁▁ 2 8 ▇▅▁▁▂ 3 300 ▇▁▁▁▁ 4 722 ▇▅▁▁▁
── Variable type: POSIXct
──────────────────────────────────────────── skim_variable n_missing
complete_rate min
1 DateIn 0 1 2009-07-06 00:00:00 2 Time In 0 1 1899-12-31 00:01:00 3
DateOut 0 1 2009-07-06 00:00:00 4 Time Out 0 1 1899-12-31 00:00:00 max
median n_unique 1 2012-04-20 00:00:00 2010-12-28 00:00:00 696 2
1899-12-31 23:58:00 1899-12-31 16:24:00 880 3 2012-04-20 00:00:00
2010-12-28 00:00:00 760 4 1899-12-31 23:59:00 1899-12-31 18:11:00
952
skim(data_s)
── Data Summary ────────────────────────
Values
Name data_s
Number of rows 4995
Number of columns 10
_______________________
Column type frequency:
character 2
numeric 4
POSIXct 4
________________________
Group variables None
── Data Summary ──────────────────────── Values Name data_s Number of
rows 4995
Number of columns 10
_______________________
Column type frequency:
character 2
numeric 4
POSIXct 4
________________________
Group variables None
── Variable type: character ────────────────────────────────────────── skim_variable n_missing complete_rate min max empty n_unique 1 Ticket_Type 0 1 4 6 0 3 2 Weekday 0 1 6 8 0 2 whitespace 1 0 2 0
── Variable type: numeric ──────────────────────────────────────────── skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 1 Vehicle 0 1 1 0 1 1 1 1 2 Equipment 0 1 5.95 1.13 5 5 6 6 3 Amount 0 1 62.5 31.4 0 50 50 70 4 TimeDiff 0 1 154. 87.7 1 91 141 200 p100 hist 1 1 ▁▁▇▁▁ 2 9 ▇▆▁▃▁ 3 300 ▇▃▁▁▁ 4 713 ▇▆▁▁▁
── Variable type: POSIXct
──────────────────────────────────────────── skim_variable n_missing
complete_rate min
1 DateIn 0 1 2009-07-04 00:00:00 2 Time In 0 1 1899-12-31 00:01:00 3
DateOut 0 1 2009-07-04 00:00:00 4 Time Out 0 1 1899-12-31 00:00:00 max
median n_unique 1 2012-03-31 00:00:00 2010-10-30 00:00:00 279 2
1899-12-31 23:58:00 1899-12-31 16:39:00 852 3 2012-04-01 00:00:00
2010-10-30 00:00:00 364 4 1899-12-31 23:59:00 1899-12-31 18:35:00
893
plot(x = data$TimeDiff, y = data$Amount,
xlab = "Length of Stay",
ylab = "Amount",
main = "Weekday"
)
plot(x = data_s$TimeDiff, y = data_s$Amount,
xlab = "Length of Stay",
ylab = "Amount",
main = "Weekend"
)
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.3 ✔ forcats 0.5.2 ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
#remove the zero and 300 amount
data = data %>% filter(Amount <300)
plot(x = data$TimeDiff, y = data$Amount,
xlab = "Length of Stay",
ylab = "Amount",
main = "Weekday"
)
data_s = data_s %>% filter(Amount <300) %>% filter(Amount>0)
plot(x = data_s$TimeDiff, y = data_s$Amount,
xlab = "Length of Stay",
ylab = "Amount",
main = "Weekend"
)
library(plotly)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
Plot for count on daily basis.
library(ggplot2)
ggplot(data, aes(x = Weekday)) +
geom_bar()
ggplot(data_s, aes(x = Weekday)) +
geom_bar()
Lets check the size of vehicles for both weekdays and weekends….
data %>% filter(Amount>30 & TimeDiff<181)
data_s %>% filter(Amount>60 & TimeDiff<181)
Adding colloumn for timing, if morning (00:00:00 - 12:00:00), afternoon (12:00:00 - 18:00:00), evening (18:00:00 - 00:00:00)
# library(data.table)
library(readxl)
# install.packages("skimr")
library("skimr")
# install.packages("tidyverse")
library("tidyverse")
# install.packages("plotly")
library("plotly")
# install.packages("lubridate")
library("lubridate")
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
# install.packages("recommenderlab")
library("recommenderlab")
Loading required package: Matrix
Attaching package: ‘Matrix’
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
Loading required package: arules
Attaching package: ‘arules’
The following object is masked from ‘package:dplyr’:
recode
The following objects are masked from ‘package:base’:
abbreviate, write
Loading required package: proxy
Attaching package: ‘proxy’
The following object is masked from ‘package:Matrix’:
as.matrix
The following objects are masked from ‘package:stats’:
as.dist, dist
The following object is masked from ‘package:base’:
as.matrix
Loading required package: registry
Registered S3 methods overwritten by 'registry':
method from
print.registry_field proxy
print.registry_entry proxy
class(data$`Time In`)
[1] "POSIXct" "POSIXt"
#hour(data$`Time In`)
# Multiple conditions when adding new column to dataframe:
data=data %>% mutate(Timing =
case_when((hour(`Time In`) >=00) & (hour(`Time In`)) <12 ~ "Morning",
(hour(`Time In`) >=12) & (hour(`Time In`)) <18 ~ "Afternoon",
(hour(`Time In`) >=18) & (hour(`Time In`)) <24 ~ "Evening"))
data_s=data_s %>% mutate(Timing =
case_when((hour(`Time In`) >=00) & (hour(`Time In`)) <12 ~ "Morning",
(hour(`Time In`) >=12) & (hour(`Time In`)) <18 ~ "Afternoon",
(hour(`Time In`) >=18) & (hour(`Time In`)) <24 ~ "Evening"))
tw = data %>%
group_by(Timing) %>%
summarise(count =n())
tw
fig <- plot_ly(tw, labels = ~Timing, values = ~count, type = 'pie')
fig <- fig %>% layout(title = 'Weekday Timing',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
people are coming in afternoon
tw = data_s %>%
group_by(Timing) %>%
summarise(count =n())
fig <- plot_ly(tw, labels = ~Timing, values = ~count, type = 'pie')
fig <- fig %>% layout(title = 'Weekend Timing',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
During Saturday Sunday , most of the person are coming in afternoon.
Distribution of time spent by vehicles —–
library(ggplot2)
## Basic histogram from the vector "rating". Each bin is .5 wide.
## These both result in the same output:
ggplot(data, aes(x=TimeDiff)) + geom_histogram(binwidth=10)
# qplot(dat$rating, binwidth=.5)
# Draw with black outline, white fill
ggplot(data, aes(x=TimeDiff)) +
geom_histogram(binwidth=10, colour="black", fill="white")
# Density curve
ggplot(data, aes(x=TimeDiff)) + geom_density()
# Histogram overlaid with kernel density curve
ggplot(data, aes(x=TimeDiff)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=10,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") # Overlay with transparent density plot
Checking Skew Value
install.packages("moments")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.2/moments_0.14.1.tgz'
Content type 'application/x-gzip' length 54374 bytes (53 KB)
==================================================
downloaded 53 KB
The downloaded binary packages are in
/var/folders/hk/_k0j74ld46j579pyq03bync00000gn/T//RtmpNqXUjs/downloaded_packages
library(moments)
skewness(data$TimeDiff)
[1] 1.40518
.If the skewness is between -0.5 and 0.5, the data are fairly symmetrical . If the skewness is between -1 and — 0.5 or between 0.5 and 1, the data are moderately skewed . If the skewness is less than -1 or greater than 1, the data are highly skewed
DISTRIBUTION FOR WEEKEND
## Basic histogram from the vector "rating". Each bin is .5 wide.
## These both result in the same output:
ggplot(data_s, aes(x=TimeDiff)) + geom_histogram(binwidth=10)
# qplot(dat$rating, binwidth=.5)
# Draw with black outline, white fill
ggplot(data_s, aes(x=TimeDiff)) +
geom_histogram(binwidth=10, colour="black", fill="white")
# Density curve
ggplot(data_s, aes(x=TimeDiff)) + geom_density()
# Histogram overlaid with kernel density curve
ggplot(data_s, aes(x=TimeDiff)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=10,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") # Overlay with transparent density plot
skewness(data_s$TimeDiff)
[1] 1.056578
Q.4
People tends to stay longer during weekends.
print("Average stay During weekday")
[1] "Average stay During weekday"
print(mean(data$TimeDiff))
[1] 135.4679
print("Average stay During weekend")
[1] "Average stay During weekend"
print(mean(data_s$TimeDiff))
[1] 151.9696
Yes people tends to stay longer during weekends.
Q.5
data_s=data_s %>% mutate(Timing =
case_when((hour(`Time In`) >=00) & (hour(`Time In`)) <14 ~ "Morning",
(hour(`Time In`) >=14) & (hour(`Time In`)) <18 ~ "Afternoon",
(hour(`Time In`) >=18) & (hour(`Time In`)) <24 ~ "Evening"))
tw = data_s %>%
group_by(Timing) %>%
summarise(count =n())
fig <- plot_ly(tw, labels = ~Timing, values = ~count, type = 'pie')
fig <- fig %>% layout(title = 'Weekend Timing',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig
Q6.. If charges for first 2 hr is 20 and 10 for every additional hour… financial effect.
new_price <- function(a) {
b=as.integer(a[8])-120
amount=20
while (b>0) {
amount=amount+10
b=b-60
}
return(amount)
}
data$newprice = apply(data,1,new_price)
sum(data$Amount)-sum(data$newprice)
[1] 24440
loss of 24440
q.8 analyst claim- aternate hypo- average ocupancy is not utmost 2 hr for weekdays. alpha - 0.05 . pop var=sam var
data_sample=sample_n(data, 500)
## Basic histogram from the vector "rating". Each bin is .5 wide.
## These both result in the same output:
ggplot(data_sample, aes(x=TimeDiff)) + geom_histogram(binwidth=.01)
# qplot(dat$rating, binwidth=.5)
# Draw with black outline, white fill
ggplot(data_sample, aes(x=TimeDiff)) +
geom_histogram(binwidth=.01, colour="black", fill="white")
# Density curve
ggplot(data_sample, aes(x=TimeDiff)) + geom_density()
# Histogram overlaid with kernel density curve
ggplot(data_sample, aes(x=TimeDiff)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.01,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666")
install.packages("BSDA")
library(BSDA)
h null : mean <=120 (claimed value) h alter : mean >120 (opposite)
The alternative hypothesis in each case indicates the direction of divergence of the population mean for x (or difference of means for x and y) from mu (i.e., “greater”, “less”, “two.sided”).
data_sample=sample_n(data, 500)
z.test(x=data_sample$TimeDiff, mu=120, sigma.x=var(data$TimeDiff),alternative = "greater")
The test statistic for the one sample z-test is 0.054417 and the corresponding p-value is 0.4783.
Since this p-value is not less than .05, we have have sufficient evidence to reject the null hypothesis.
Thank you !!!!!